My Netflix's viewing activities¶
Data Skills¶
- Data Transformation: change data type (using datetime), calculate new metrics, extract strings (using regex)
- Data Aggregation (grouping and summary)
- Data Visualization (using matplotlib)
Questions to Answer¶
- Which user watches the most on Netflix?
- How much time does each user spend on average?
- Do two users watch the same show?
- What are the binge watching records for a particular user?
In [ ]:
import pandas as pd
import datetime
df = pd.read_csv('netflix-report\CONTENT_INTERACTION\ViewingActivity.csv')
Data Exploration¶
In [ ]:
# Understand data structure: 6113 rows and 10 columns
df.shape
Out[ ]:
(6113, 10)
In [ ]:
# Preview data
df.head()
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Supplemental Video Type | Device Type | Bookmark | Latest Bookmark | Country | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AlyCha | 2024-08-09 05:24:34 | 00:10:41 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | NaN | iPhone 12 | 00:22:34 | 00:22:34 | VN (Viet Nam) |
1 | AlyCha | 2024-08-09 01:41:23 | 00:02:23 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | NaN | iPhone 12 | 00:11:53 | Not latest view | VN (Viet Nam) |
2 | AlyCha | 2024-08-09 01:28:07 | 00:09:10 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | NaN | iPhone 12 | 00:09:29 | Not latest view | VN (Viet Nam) |
3 | AlyCha | 2024-08-09 01:09:56 | 00:18:59 | NaN | The Big Bang Theory: Season 12: The Change Con... | NaN | iPhone 12 | 00:19:19 | 00:19:19 | VN (Viet Nam) |
4 | AlyCha | 2024-08-09 00:23:03 | 00:19:36 | NaN | The Big Bang Theory: Season 12: The Maternal C... | NaN | iPhone 12 | 00:19:56 | 00:19:56 | VN (Viet Nam) |
In [ ]:
# See unique profile names
df["Profile Name"].unique()
Out[ ]:
array(['AlyCha', 'Chi', 'Mom'], dtype=object)
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6113 entries, 0 to 6112 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Profile Name 6113 non-null object 1 Start Time 6113 non-null object 2 Duration 6113 non-null object 3 Attributes 1836 non-null object 4 Title 6113 non-null object 5 Supplemental Video Type 1124 non-null object 6 Device Type 6113 non-null object 7 Bookmark 6113 non-null object 8 Latest Bookmark 6113 non-null object 9 Country 6113 non-null object dtypes: object(10) memory usage: 477.7+ KB
Data Transformation¶
In [ ]:
# Transform Start Time to datetime format and verify
df['Start Time'] = pd.to_datetime(df['Start Time'], utc=True)
df.dtypes
Out[ ]:
Profile Name object Start Time datetime64[ns, UTC] Duration object Attributes object Title object Supplemental Video Type object Device Type object Bookmark object Latest Bookmark object Country object dtype: object
In [ ]:
# Transform Duration to seconds
df['Hours'] = df['Duration'].str.slice(0,2).astype('int64')
df['Minutes'] = df['Duration'].str.slice(3,5).astype('int64')
df['Seconds'] = df['Duration'].str.slice(6,8).astype('int64')
df['Duration'] = df['Hours'] * 3600 + df['Minutes'] * 60 + df['Seconds']
In [ ]:
# Remove helper columns
df = df.drop(columns=['Hours','Minutes', 'Seconds'], axis=1)
df.head()
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Supplemental Video Type | Device Type | Bookmark | Latest Bookmark | Country | |
---|---|---|---|---|---|---|---|---|---|---|
0 | AlyCha | 2024-08-09 05:24:34+00:00 | 641 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | NaN | iPhone 12 | 00:22:34 | 00:22:34 | VN (Viet Nam) |
1 | AlyCha | 2024-08-09 01:41:23+00:00 | 143 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | NaN | iPhone 12 | 00:11:53 | Not latest view | VN (Viet Nam) |
2 | AlyCha | 2024-08-09 01:28:07+00:00 | 550 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | NaN | iPhone 12 | 00:09:29 | Not latest view | VN (Viet Nam) |
3 | AlyCha | 2024-08-09 01:09:56+00:00 | 1139 | NaN | The Big Bang Theory: Season 12: The Change Con... | NaN | iPhone 12 | 00:19:19 | 00:19:19 | VN (Viet Nam) |
4 | AlyCha | 2024-08-09 00:23:03+00:00 | 1176 | NaN | The Big Bang Theory: Season 12: The Maternal C... | NaN | iPhone 12 | 00:19:56 | 00:19:56 | VN (Viet Nam) |
Data Analysis¶
Q1: Which profile has the most viewing activities?
In [ ]:
df['Profile Name'].value_counts()
# Alyssa is the winner
Out[ ]:
Profile Name AlyCha 2584 Mom 1940 Chi 1589 Name: count, dtype: int64
Q2: Which profile watches the most time?
In [ ]:
# To put into context, the data was collected from Oct 2020 through Aug 2024 (~ 4 years)
df['Start Time'].agg(['min', 'max'])
Out[ ]:
min 2020-10-17 14:11:50+00:00 max 2024-08-15 11:30:14+00:00 Name: Start Time, dtype: datetime64[ns, UTC]
In [ ]:
viewTime = df.groupby('Profile Name')['Duration'].sum()
viewTime = viewTime.to_frame().reset_index()
viewTime['Duration'] = viewTime['Duration']/3600
viewTime['Duration'] = viewTime['Duration'].round().astype('int64')
viewTime.sort_values(by='Duration', ascending=False)
# Mom watches the most (contrary to what she claimed :))
Out[ ]:
Profile Name | Duration | |
---|---|---|
2 | Mom | 842 |
0 | AlyCha | 564 |
1 | Chi | 392 |
In [ ]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
# Visualize data using bar chart
viewTime.sort_values(by='Duration', ascending=False).plot(kind="bar", x="Profile Name", xlabel='', y='Duration', ylabel='Hours Watched', title="Watch time per Profile", legend=False, color='red', figsize=(4,3))
plt.xticks(rotation=360, horizontalalignment="center")
plt.gca().set_facecolor('black')
Q3: Find mutual titles that Chi and Alyssa watched
In [ ]:
# Compile a list of unique titles that Alyssa watched
Aly_shows = df[(df['Profile Name'] == 'AlyCha') & (df['Supplemental Video Type'].isnull())]['Title'].unique()
In [ ]:
# Compile a list of unique titles that Chi watched
Chi_shows = df[(df['Profile Name'] == 'Chi') & (df['Supplemental Video Type'].isnull())]['Title'].unique()
In [ ]:
# Loop through two lists to find overlapping titles
mutual_shows = []
for i in Chi_shows:
for j in Aly_shows:
if i == j:
mutual_shows.append(i)
print(f'Alyssa and Chi have watched {len(mutual_shows)} titles in common.')
print('Some of them are:')
import random
for i in random.sample(mutual_shows, 5):
print(f'* {i}')
Alyssa and Chi have watched 236 titles in common. Some of them are: * The Big Bang Theory: Season 6: The 43 Peculiarity (Episode 8) * The Big Bang Theory: Season 7: The Proton Transmogrification (Episode 22) * SPY x FAMILY: Season 1: Operation Strix (Episode 1) * The Big Bang Theory: Season 7: The Locomotive Manipulation (Episode 15) * The Big Bang Theory: Season 10: The Romance Recalibration (Episode 13)
Q4: Show average watch time per user excluding trailers and hooks
In [ ]:
# Extract only true watching activities (excluding trailers and hooks)
watch = df[df['Supplemental Video Type'].isnull()]
watch = watch.drop('Supplemental Video Type', axis=1)
watch.head()
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Device Type | Bookmark | Latest Bookmark | Country | |
---|---|---|---|---|---|---|---|---|---|
0 | AlyCha | 2024-08-09 05:24:34+00:00 | 641 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | iPhone 12 | 00:22:34 | 00:22:34 | VN (Viet Nam) |
1 | AlyCha | 2024-08-09 01:41:23+00:00 | 143 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | iPhone 12 | 00:11:53 | Not latest view | VN (Viet Nam) |
2 | AlyCha | 2024-08-09 01:28:07+00:00 | 550 | NaN | The Big Bang Theory: Season 12: The Stockholm ... | iPhone 12 | 00:09:29 | Not latest view | VN (Viet Nam) |
3 | AlyCha | 2024-08-09 01:09:56+00:00 | 1139 | NaN | The Big Bang Theory: Season 12: The Change Con... | iPhone 12 | 00:19:19 | 00:19:19 | VN (Viet Nam) |
4 | AlyCha | 2024-08-09 00:23:03+00:00 | 1176 | NaN | The Big Bang Theory: Season 12: The Maternal C... | iPhone 12 | 00:19:56 | 00:19:56 | VN (Viet Nam) |
In [ ]:
# Group watchtime by Profile and show average watchtime in minutes
summary = watch.groupby('Profile Name')['Duration'].mean().round()
summary = summary.to_frame().reset_index()
summary['Avg Duration (min)'] = summary['Duration']/60
summary['Avg Duration (min)'] = summary['Avg Duration (min)'].round().astype('int')
summary = summary[['Profile Name', 'Avg Duration (min)']].sort_values(by='Avg Duration (min)', ascending=False)
summary
Out[ ]:
Profile Name | Avg Duration (min) | |
---|---|---|
2 | Mom | 29 |
1 | Chi | 19 |
0 | AlyCha | 17 |
In [ ]:
# Visualize results in bar chart
summary.plot(kind='bar', x='Profile Name', y = 'Avg Duration (min)', xlabel='', ylabel='Minutes', title=" Average Watch time per Profile", legend=False, color='red', figsize=(4,3))
plt.xticks(rotation=360, horizontalalignment="center")
plt.gca().set_facecolor('black')
Q5: Binge records for Chi
Define binge watching: Watch 3+ episodes of the same series in the same day
In [ ]:
# Create a new dataframe for Chi's viewing activities
chi = watch[watch['Profile Name'] == 'Chi']
chi.head()
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Device Type | Bookmark | Latest Bookmark | Country | |
---|---|---|---|---|---|---|---|---|---|
2584 | Chi | 2024-08-15 11:30:14+00:00 | 2765 | NaN | Downton Abbey: Series 2: Episode 5 | DefaultWidevineAndroidPhone | 00:54:45 | 00:54:45 | FR (France) |
2585 | Chi | 2024-08-15 10:18:36+00:00 | 3078 | NaN | Downton Abbey: Series 2: Episode 4 | Chrome PC (Cadmium) | 00:53:40 | 00:53:40 | FR (France) |
2591 | Chi | 2024-08-14 17:17:43+00:00 | 1273 | NaN | The Umbrella Academy: Season 4: End of the Beg... | Chrome PC (Cadmium) | 01:07:17 | 01:07:17 | FR (France) |
2593 | Chi | 2024-08-14 11:32:33+00:00 | 2521 | NaN | The Umbrella Academy: Season 4: End of the Beg... | DefaultWidevineAndroidPhone | 00:43:48 | Not latest view | FR (France) |
2594 | Chi | 2024-08-14 11:02:00+00:00 | 1826 | NaN | The Umbrella Academy: Season 4: Six Years, Fiv... | DefaultWidevineAndroidPhone | 00:54:52 | 00:54:52 | FR (France) |
In [ ]:
# Extract only show title
chi.loc[:,'Show'] = chi['Title'].str.extract(r'(.+?(?=:))')
chi.head()
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Device Type | Bookmark | Latest Bookmark | Country | Show | |
---|---|---|---|---|---|---|---|---|---|---|
2584 | Chi | 2024-08-15 11:30:14+00:00 | 2765 | NaN | Downton Abbey: Series 2: Episode 5 | DefaultWidevineAndroidPhone | 00:54:45 | 00:54:45 | FR (France) | Downton Abbey |
2585 | Chi | 2024-08-15 10:18:36+00:00 | 3078 | NaN | Downton Abbey: Series 2: Episode 4 | Chrome PC (Cadmium) | 00:53:40 | 00:53:40 | FR (France) | Downton Abbey |
2591 | Chi | 2024-08-14 17:17:43+00:00 | 1273 | NaN | The Umbrella Academy: Season 4: End of the Beg... | Chrome PC (Cadmium) | 01:07:17 | 01:07:17 | FR (France) | The Umbrella Academy |
2593 | Chi | 2024-08-14 11:32:33+00:00 | 2521 | NaN | The Umbrella Academy: Season 4: End of the Beg... | DefaultWidevineAndroidPhone | 00:43:48 | Not latest view | FR (France) | The Umbrella Academy |
2594 | Chi | 2024-08-14 11:02:00+00:00 | 1826 | NaN | The Umbrella Academy: Season 4: Six Years, Fiv... | DefaultWidevineAndroidPhone | 00:54:52 | 00:54:52 | FR (France) | The Umbrella Academy |
In [ ]:
# Extract date from watch time
from datetime import datetime
chi.loc[:,'Date'] = chi['Start Time'].dt.date
chi.head()
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Device Type | Bookmark | Latest Bookmark | Country | Show | Date | |
---|---|---|---|---|---|---|---|---|---|---|---|
2584 | Chi | 2024-08-15 11:30:14+00:00 | 2765 | NaN | Downton Abbey: Series 2: Episode 5 | DefaultWidevineAndroidPhone | 00:54:45 | 00:54:45 | FR (France) | Downton Abbey | 2024-08-15 |
2585 | Chi | 2024-08-15 10:18:36+00:00 | 3078 | NaN | Downton Abbey: Series 2: Episode 4 | Chrome PC (Cadmium) | 00:53:40 | 00:53:40 | FR (France) | Downton Abbey | 2024-08-15 |
2591 | Chi | 2024-08-14 17:17:43+00:00 | 1273 | NaN | The Umbrella Academy: Season 4: End of the Beg... | Chrome PC (Cadmium) | 01:07:17 | 01:07:17 | FR (France) | The Umbrella Academy | 2024-08-14 |
2593 | Chi | 2024-08-14 11:32:33+00:00 | 2521 | NaN | The Umbrella Academy: Season 4: End of the Beg... | DefaultWidevineAndroidPhone | 00:43:48 | Not latest view | FR (France) | The Umbrella Academy | 2024-08-14 |
2594 | Chi | 2024-08-14 11:02:00+00:00 | 1826 | NaN | The Umbrella Academy: Season 4: Six Years, Fiv... | DefaultWidevineAndroidPhone | 00:54:52 | 00:54:52 | FR (France) | The Umbrella Academy | 2024-08-14 |
In [ ]:
chi[chi['Show']=='The Good Doctor'].sort_values(by='Date').head()
# After examination, I realized that each row records a time I opened the player, so it counts in the times when I had to reopen the video or switch to another device
# Solution: filter only sessions with watch time > 15 minutes (900s)
Out[ ]:
Profile Name | Start Time | Duration | Attributes | Title | Device Type | Bookmark | Latest Bookmark | Country | Show | Date | |
---|---|---|---|---|---|---|---|---|---|---|---|
4172 | Chi | 2021-07-25 15:07:29+00:00 | 116 | NaN | The Good Doctor: Season 1: Burnt Food (Episode 1) | Chrome PC (Cadmium) | 00:01:56 | Not latest view | VN (Viet Nam) | The Good Doctor | 2021-07-25 |
4171 | Chi | 2021-07-25 15:18:20+00:00 | 151 | NaN | The Good Doctor: Season 1: Burnt Food (Episode 1) | DefaultWidevineAndroidTablets | 00:04:36 | Not latest view | VN (Viet Nam) | The Good Doctor | 2021-07-25 |
4163 | Chi | 2021-08-16 16:24:12+00:00 | 91 | NaN | The Good Doctor: Season 1: Burnt Food (Episode 1) | Google Smart Display Nest Hub 2018 | 00:13:48 | Not latest view | VN (Viet Nam) | The Good Doctor | 2021-08-16 |
4162 | Chi | 2021-08-16 16:26:20+00:00 | 4 | NaN | The Good Doctor: Season 1: Burnt Food (Episode 1) | Google Smart Display Nest Hub 2018 | 00:13:42 | Not latest view | VN (Viet Nam) | The Good Doctor | 2021-08-16 |
4161 | Chi | 2021-08-16 16:27:12+00:00 | 8 | NaN | The Good Doctor: Season 1: Burnt Food (Episode 1) | DefaultWidevineAndroidTablets | 00:13:55 | 00:13:55 | VN (Viet Nam) | The Good Doctor | 2021-08-16 |
In [ ]:
# Extract an aggregated summary of watches per show
binge = chi[['Date','Show','Duration']][chi['Duration'] > 900]
binge = binge.groupby(['Date','Show']).agg(Count = ('Show','count'), TotalTime = ('Duration','sum')).reset_index()
binge.head()
Out[ ]:
Date | Show | Count | TotalTime | |
---|---|---|---|---|
0 | 2021-08-16 | The Good Doctor | 1 | 2531 |
1 | 2021-08-18 | The Good Doctor | 2 | 5019 |
2 | 2021-08-19 | The Good Doctor | 1 | 2558 |
3 | 2021-08-20 | The Good Doctor | 2 | 4940 |
4 | 2021-08-21 | The Good Doctor | 3 | 5094 |
In [ ]:
#Top 10 binge by number of episodes watched in a day
Top10byEp = binge.sort_values(by='Count', ascending=False).head(10).reset_index(drop=True)
Top10byEp.rename(columns={'Count':'No.Episodes'}, inplace=True)
Top10byEp = Top10byEp.drop('TotalTime', axis=1)
Top10byEp
Out[ ]:
Date | Show | No.Episodes | |
---|---|---|---|
0 | 2021-09-10 | The Big Bang Theory | 6 |
1 | 2023-02-21 | The Hook Up Plan | 5 |
2 | 2021-10-08 | The Big Bang Theory | 5 |
3 | 2021-09-17 | The Big Bang Theory | 5 |
4 | 2021-09-27 | The Big Bang Theory | 5 |
5 | 2022-01-10 | The Big Bang Theory | 5 |
6 | 2022-04-24 | Heartstopper | 5 |
7 | 2021-10-02 | The Big Bang Theory | 5 |
8 | 2021-08-29 | Lupin | 5 |
9 | 2021-10-10 | The Big Bang Theory | 5 |
In [ ]:
#Top 10 binge by total time watched in a day (in hours)
Top10byTime = binge.sort_values(by='TotalTime', ascending=False).head(10).reset_index(drop=True)
Top10byTime['TotalTime'] = (Top10byTime['TotalTime']/3600).round(1).astype('float')
Top10byTime
Out[ ]:
Date | Show | Count | TotalTime | |
---|---|---|---|---|
0 | 2021-08-29 | Lupin | 5 | 3.0 |
1 | 2023-10-06 | Sex Education | 4 | 3.0 |
2 | 2022-06-13 | Venom | 3 | 2.6 |
3 | 2021-09-02 | Bridgerton | 3 | 2.6 |
4 | 2023-10-09 | Sex Education | 3 | 2.4 |
5 | 2024-06-17 | Bridgerton | 3 | 2.3 |
6 | 2021-09-05 | Bridgerton | 4 | 2.3 |
7 | 2024-08-14 | The Umbrella Academy | 5 | 2.3 |
8 | 2022-07-03 | Stranger Things | 2 | 2.3 |
9 | 2021-08-28 | Lupin | 3 | 2.2 |